Problem Statement¶

The objective of this report is to classify the species of an Iris flower given the dimensions of parts of the flower.¶

Getting Data¶

In [293]:
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly
import plotly.express as px
from mpl_toolkits.mplot3d import Axes3D
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss
plotly.offline.init_notebook_mode()
In [294]:
iris = load_iris(as_frame=True)
print(iris)
{'data':      sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                  5.1               3.5                1.4               0.2
1                  4.9               3.0                1.4               0.2
2                  4.7               3.2                1.3               0.2
3                  4.6               3.1                1.5               0.2
4                  5.0               3.6                1.4               0.2
..                 ...               ...                ...               ...
145                6.7               3.0                5.2               2.3
146                6.3               2.5                5.0               1.9
147                6.5               3.0                5.2               2.0
148                6.2               3.4                5.4               2.3
149                5.9               3.0                5.1               1.8

[150 rows x 4 columns], 'target': 0      0
1      0
2      0
3      0
4      0
      ..
145    2
146    2
147    2
148    2
149    2
Name: target, Length: 150, dtype: int32, 'frame':      sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)   
0                  5.1               3.5                1.4               0.2  \
1                  4.9               3.0                1.4               0.2   
2                  4.7               3.2                1.3               0.2   
3                  4.6               3.1                1.5               0.2   
4                  5.0               3.6                1.4               0.2   
..                 ...               ...                ...               ...   
145                6.7               3.0                5.2               2.3   
146                6.3               2.5                5.0               1.9   
147                6.5               3.0                5.2               2.0   
148                6.2               3.4                5.4               2.3   
149                5.9               3.0                5.1               1.8   

     target  
0         0  
1         0  
2         0  
3         0  
4         0  
..      ...  
145       2  
146       2  
147       2  
148       2  
149       2  

[150 rows x 5 columns], 'target_names': array(['setosa', 'versicolor', 'virginica'], dtype='<U10'), 'DESCR': '.. _iris_dataset:\n\nIris plants dataset\n--------------------\n\n**Data Set Characteristics:**\n\n    :Number of Instances: 150 (50 in each of three classes)\n    :Number of Attributes: 4 numeric, predictive attributes and the class\n    :Attribute Information:\n        - sepal length in cm\n        - sepal width in cm\n        - petal length in cm\n        - petal width in cm\n        - class:\n                - Iris-Setosa\n                - Iris-Versicolour\n                - Iris-Virginica\n                \n    :Summary Statistics:\n\n    ============== ==== ==== ======= ===== ====================\n                    Min  Max   Mean    SD   Class Correlation\n    ============== ==== ==== ======= ===== ====================\n    sepal length:   4.3  7.9   5.84   0.83    0.7826\n    sepal width:    2.0  4.4   3.05   0.43   -0.4194\n    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)\n    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)\n    ============== ==== ==== ======= ===== ====================\n\n    :Missing Attribute Values: None\n    :Class Distribution: 33.3% for each of 3 classes.\n    :Creator: R.A. Fisher\n    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n    :Date: July, 1988\n\nThe famous Iris database, first used by Sir R.A. Fisher. The dataset is taken\nfrom Fisher\'s paper. Note that it\'s the same as in R, but not as in the UCI\nMachine Learning Repository, which has two wrong data points.\n\nThis is perhaps the best known database to be found in the\npattern recognition literature.  Fisher\'s paper is a classic in the field and\nis referenced frequently to this day.  (See Duda & Hart, for example.)  The\ndata set contains 3 classes of 50 instances each, where each class refers to a\ntype of iris plant.  One class is linearly separable from the other 2; the\nlatter are NOT linearly separable from each other.\n\n.. topic:: References\n\n   - Fisher, R.A. "The use of multiple measurements in taxonomic problems"\n     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to\n     Mathematical Statistics" (John Wiley, NY, 1950).\n   - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.\n     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.\n   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System\n     Structure and Classification Rule for Recognition in Partially Exposed\n     Environments".  IEEE Transactions on Pattern Analysis and Machine\n     Intelligence, Vol. PAMI-2, No. 1, 67-71.\n   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions\n     on Information Theory, May 1972, 431-433.\n   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II\n     conceptual clustering system finds 3 classes in the data.\n   - Many, many more ...', 'feature_names': ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'], 'filename': 'iris.csv', 'data_module': 'sklearn.datasets.data'}
In [295]:
type(iris)
Out[295]:
sklearn.utils._bunch.Bunch
In [296]:
iris.keys()
Out[296]:
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])
In [297]:
iris['feature_names']
Out[297]:
['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']
In [298]:
iris['target_names']
Out[298]:
array(['setosa', 'versicolor', 'virginica'], dtype='<U10')
In [299]:
iris_df = iris['frame']
iris_df
Out[299]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 2
146 6.3 2.5 5.0 1.9 2
147 6.5 3.0 5.2 2.0 2
148 6.2 3.4 5.4 2.3 2
149 5.9 3.0 5.1 1.8 2

150 rows × 5 columns

Exploratory Data Analysis¶

In [300]:
# Review thr top 5 records of the data frame
iris_df.head()
Out[300]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0
In [301]:
iris_df['target'] = np.where(iris_df['target'] == 2, 'virginica', 'non-virginica')
In [302]:
iris_df.head()
Out[302]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
0 5.1 3.5 1.4 0.2 non-virginica
1 4.9 3.0 1.4 0.2 non-virginica
2 4.7 3.2 1.3 0.2 non-virginica
3 4.6 3.1 1.5 0.2 non-virginica
4 5.0 3.6 1.4 0.2 non-virginica
In [303]:
# Review information like names of columns, data type and total records
df_info = iris_df.info()
print(df_info)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   target             150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
None
In [304]:
# Check if there are any null values in dataset
df_is_null = iris_df.isnull().sum()
print(df_is_null)
sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
target               0
dtype: int64
In [305]:
# descriptive statistics for each of the two classes.
df_stats = iris_df.groupby('target').describe()
df_stats
Out[305]:
sepal length (cm) sepal width (cm) ... petal length (cm) petal width (cm)
count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25% 50% 75% max
target
non-virginica 100.0 5.471 0.641698 4.3 5.000 5.4 5.9 7.0 100.0 3.099 ... 4.325 5.1 100.0 0.786 0.565153 0.1 0.2 0.8 1.3 1.8
virginica 50.0 6.588 0.635880 4.9 6.225 6.5 6.9 7.9 50.0 2.974 ... 5.875 6.9 50.0 2.026 0.274650 1.4 1.8 2.0 2.3 2.5

2 rows × 32 columns

Visualization¶

Histograms¶

In [306]:
plt.figure(figsize=(15, 10))
for i, col in enumerate(iris_df.columns, start=1):
    # Skip the 'target' column
    if col == 'target':
        continue
    
    # Create a subplot for each column
    plt.subplot(2, 2, i)
    
    # Create the histogram with hue='target'
    sns.histplot(data=iris_df, x=col, hue='target', kde=True)
    
    # Set the title
    plt.title('Histogram of {}'.format(col))

# Show the plot
plt.tight_layout()
plt.show()

Correlation Matrix¶

In [307]:
corr_matrix = iris_df.drop('target', axis=1).corr()
plt.figure(figsize=(10,8))
sns.heatmap(corr_matrix, annot=True, cmap='Blues', fmt=".2f")
plt.title('Correlation Matrix')
plt.show()

Pair Plot¶

Shows the bivariate relation between each pair of features

Reference : Python Data Visualizations

In [308]:
sns.pairplot(iris_df, hue="target", size=3 , diag_kind="kde")
d:\Jency\Class Work\AI and Machine Learning\Foundations of Machine Learning Frameworks\CSCN8010\venv\CSCN8010_classic_ml\Lib\site-packages\seaborn\axisgrid.py:2095: UserWarning:

The `size` parameter has been renamed to `height`; please update your code.

Out[308]:
<seaborn.axisgrid.PairGrid at 0x1f7a0918d90>

Box Plot¶

Shows the distribution of data and skewness

Reference : Python Data Visualizations

In [309]:
iris_df.boxplot(by="target", figsize=(12, 6))
Out[309]:
array([[<Axes: title={'center': 'petal length (cm)'}, xlabel='[target]'>,
        <Axes: title={'center': 'petal width (cm)'}, xlabel='[target]'>],
       [<Axes: title={'center': 'sepal length (cm)'}, xlabel='[target]'>,
        <Axes: title={'center': 'sepal width (cm)'}, xlabel='[target]'>]],
      dtype=object)

Scatter_3D¶

Shows the 3 dimensional view of 3 features

Reference : 3D plots

In [310]:
fig = px.scatter_3d(iris_df, x='sepal length (cm)', y='sepal width (cm)', z='petal length (cm)',
              color='target')
fig.show()

Insights:¶

  • Looks like non-Virginica species are more prevalent in the dataset
  • The correlation matrix reveals that petal length and petal width have strong positive correlations with sepal length, and moderate negative correlations with sepal width.
  • From the histograms it is evident that Virginica tends to have larger Petal Lengths and Widths compared to Non-Virginica.

Logistic Regression¶

Data Split to Train, Test and Validation¶

In [311]:
X = iris_df.drop(['target'], axis=1)
y = iris_df[["target"]] == "virginica"
In [312]:
train_ratio = .81
test_ratio = .10
validation_ratio = .10

models = {}
for i in range(1, 5):
    X_i = X.iloc[:, :i]

    # Split the data into a train test and validation set 
    X_train, X_test, y_train, y_test = train_test_split(X_i, y.values.ravel(), test_size=test_ratio, random_state=42) 
    X_train, X_val , y_train, y_val = train_test_split(X_train, y_train, test_size=validation_ratio/(train_ratio+test_ratio), random_state=42)

    # Create and train the logistic regression model with 'i' features
    model = LogisticRegression(random_state=42)
    model.fit(X_train, y_train.ravel())
    models[i] = model

print(models)
{1: LogisticRegression(random_state=42), 2: LogisticRegression(random_state=42), 3: LogisticRegression(random_state=42), 4: LogisticRegression(random_state=42)}

Model Evaluation¶

Prediction on validation set¶

In [313]:
metrics = []

for i, model in models.items():
    X_val_i = X_val.iloc[:, :i]

    # Use the model to predict the probabilities and the classes
    proba_val = model.predict_proba(X_val_i)[:, 1]  # Probability of 'virginica'
    pred_val = model.predict(X_val_i)  # Predicted class
    print(f'\nModel : {model} with {i} features')
  

    table = pd.DataFrame({
    'Instance number': X_val_i.index ,
    'Probability of virginica': proba_val,
    'Model prediction': pred_val, 
    'Ground truth': y_val})

    print(table)

    print(f"Summary of the model with {i} features on the validation set:")
    # Calculate the accuracy of the model on the test set
    accuracy_val = np.mean(pred_val == y_val)
    print(f"Accuracy : {accuracy_val}")

    # Calculate the log loss
    logloss_val = log_loss(y_val, proba_val)
    print(f"Log loss of the model : {logloss_val}")
    metrics.append({'Model': f'Model Feature={i}',
            'Log-Loss': logloss_val})

   
Model : LogisticRegression(random_state=42) with 1 features
    Instance number  Probability of virginica  Model prediction  Ground truth
0               115                  0.576720              True          True
1                35                  0.064688             False         False
2                91                  0.418391             False         False
3                42                  0.018915             False         False
4                40                  0.064688             False         False
5                53                  0.167034             False         False
6               143                  0.761502              True          True
7                88                  0.198786             False         False
8                87                  0.524087              True         False
9               149                  0.319692             False          True
10               60                  0.064688             False         False
11               67                  0.275261             False         False
12               86                  0.720719              True         False
13              139                  0.797999              True          True
14                1                  0.052939             False         False
Summary of the model with 1 features on the validation set:
Accuracy : 0.8
Log loss of the model : 0.3832909328304838

Model : LogisticRegression(random_state=42) with 2 features
    Instance number  Probability of virginica  Model prediction  Ground truth
0               115                  0.527268              True          True
1                35                  0.054036             False         False
2                91                  0.413153             False         False
3                42                  0.015733             False         False
4                40                  0.041985             False         False
5                53                  0.267725             False         False
6               143                  0.722770              True          True
7                88                  0.195875             False         False
8                87                  0.666397              True         False
9               149                  0.315295             False          True
10               60                  0.141545             False         False
11               67                  0.326790             False         False
12               86                  0.697266              True         False
13              139                  0.778828              True          True
14                1                  0.052245             False         False
Summary of the model with 2 features on the validation set:
Accuracy : 0.8
Log loss of the model : 0.42936650538756177

Model : LogisticRegression(random_state=42) with 3 features
    Instance number  Probability of virginica  Model prediction  Ground truth
0               115                  0.779572              True          True
1                35                  0.000005             False         False
2                91                  0.290399             False         False
3                42                  0.000009             False         False
4                40                  0.000006             False         False
5                53                  0.090373             False         False
6               143                  0.959330              True          True
7                88                  0.082049             False         False
8                87                  0.225165             False         False
9               149                  0.706750              True          True
10               60                  0.025216             False         False
11               67                  0.090214             False         False
12               86                  0.304164             False         False
13              139                  0.814903              True          True
14                1                  0.000012             False         False
Summary of the model with 3 features on the validation set:
Accuracy : 1.0
Log loss of the model : 0.14023623382003317

Model : LogisticRegression(random_state=42) with 4 features
    Instance number  Probability of virginica  Model prediction  Ground truth
0               115                  0.901372              True          True
1                35                  0.000003             False         False
2                91                  0.212009             False         False
3                42                  0.000004             False         False
4                40                  0.000003             False         False
5                53                  0.079641             False         False
6               143                  0.976464              True          True
7                88                  0.061013             False         False
8                87                  0.169502             False         False
9               149                  0.716479              True          True
10               60                  0.017029             False         False
11               67                  0.039739             False         False
12               86                  0.253573             False         False
13              139                  0.881182              True          True
14                1                  0.000005             False         False
Summary of the model with 4 features on the validation set:
Accuracy : 1.0
Log loss of the model : 0.10051287143887465
In [314]:
metrics_summary = f"""| Model | Log-Loss|
|\n|-------|----------|\n"""
for result in metrics:
    metrics_summary += "| {Model} | {Log-Loss} |\n".format(**result)
    
print(metrics_summary)
| Model | Log-Loss|
|
|-------|----------|
| Model Feature=1 | 0.3832909328304838 |
| Model Feature=2 | 0.42936650538756177 |
| Model Feature=3 | 0.14023623382003317 |
| Model Feature=4 | 0.10051287143887465 |

Plot the decision boundary¶

In [315]:
# Create a figure
fig = plt.figure(figsize=(18, 6))

for i, model in models.items():
    if i == 1:
        ax = fig.add_subplot(1, 2, 1)
        decision_boundary = -model.intercept_ / model.coef_
        ax.plot(decision_boundary, color='black')
        ax.scatter(X_val.iloc[:, :i], [i]*len(X_val), c=y_val, edgecolors='k', linewidth=1, alpha=0.6)
        ax.set_title('Decision Boundary for Model with 1 Feature')
    
    elif i == 2:
        ax = fig.add_subplot(1, 2, 2)
        decision_boundary_x1 = np.linspace(X_val.iloc[:, :i].min(), X_val.iloc[:, :i].max(), 100)
        decision_boundary_x2 = -model.intercept_ / model.coef_[0][1] - model.coef_[0][0] / model.coef_[0][1] * decision_boundary_x1
        ax.plot(decision_boundary_x1, decision_boundary_x2, color='black')
        ax.set_title('Decision Boundary for Model with 2 Features')
        plt.show()
    elif i == 3:
        x1, x2 = np.meshgrid(np.linspace(X_val.iloc[:, :i].min(), X_val.iloc[:, :i].max(), 10), np.linspace(X_val.iloc[:, :i].min(), X_val.iloc[:, :i].max(), 10))
        decision_boundary_x3 = -model.intercept_ / model.coef_[0][2] - model.coef_[0][0] / model.coef_[0][2] * x1 - model.coef_[0][1] / model.coef_[0][2] * x2
        df = pd.DataFrame({
            'X1': x1.flatten(),
            'X2': x2.flatten(),
            'Decision Boundary': decision_boundary_x3.flatten()
        })
        # Create the 3D scatter plot
        fig = px.scatter_3d(df, x='X1', y='X2', z='Decision Boundary')
        fig.update_layout(title_text='Decision Boundary for Model with 3 Features')
        fig.show()

Failure modes¶

The model with 1 and 2 features made 3 incorrect predictions out of 15

  • Instance 8: The model predicted ‘virginica’ (True), but the ground truth is ‘non-virginica’ (False).
  • Instance 9: The model predicted ‘non-virginica’ (False), but the ground truth is ‘virginica’ (True).
  • Instance 12: The model predicted ‘virginica’ (True), but the ground truth is ‘non-virginica’ (False).

Model with 3 and 4 features had given accurate predictions of all 15 instances. This suggests that the additional features in the model 3 and 4 may be providing important information that helps the model correctly classify this instance.

Identify the best model¶

Based on the accuracy and logloss, the model with 4 features seems to be the best choice. Here’s why:

It has given accurate predictions on the validation set compared to the models with 1 and 2 features. It uses more information (i.e., more features) to make its predictions, which generally leads to better performance. Also it has the lowest log loss among the 4 models of approximately 0.1005 on the validation set

Run the chosen model on the test set¶

To summarize the results of best model on the test set, we would use the same process as we did for the validation set.

In [316]:
 
X_test_i = X_test.iloc[:, :4]
proba = models[4].predict_proba(X_test_i)[:, 1] 
pred = models[4].predict(X_test_i)

# Calculate the accuracy of the model on the test set
accuracy_test = np.mean(pred == y_test)
print(f"Accuracy of the model with 4 features on the test set: {accuracy_test}")

# Calculate the log loss
logloss_test = log_loss(y_test, proba)
print(f"Log loss of the model with 4 features on the test set: {logloss_test}")




table = pd.DataFrame({
    'Instance number': X_test_i.index ,
    'Probability of virginica': proba,
    'Model prediction': pred, 
    'Ground truth': y_test})

print(f"Model with 4 features on test set:")
print(table)
Accuracy of the model with 4 features on the test set: 1.0
Log loss of the model with 4 features on the test set: 0.11796331617414953
Model with 4 features on test set:
    Instance number  Probability of virginica  Model prediction  Ground truth
0                73                  0.213315             False         False
1                18                  0.000007             False         False
2               118                  0.998688              True          True
3                78                  0.218776             False         False
4                76                  0.306342             False         False
5                31                  0.000007             False         False
6                64                  0.017760             False         False
7               141                  0.832739              True          True
8                68                  0.310019             False         False
9                82                  0.035147             False         False
10              110                  0.735148              True          True
11               12                  0.000005             False         False
12               36                  0.000002             False         False
13                9                  0.000005             False         False
14               19                  0.000004             False         False

Conclusion¶

The model that performed the best was the Model with 4 features. It had an Accuracy value of 1 on the test set, which means it has correctly classified all instances of test set. Also the model has a lower log loss of approximately 0.11 on the test set.